{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Case Study: Debugging in Regression.ipynb",
      "version": "0.3.2",
      "provenance": [
        {
          "file_id": "1K6dyxkBfrXVxppRzGHUnMFnK2Otnlp5i",
          "timestamp": 1541016529771
        }
      ],
      "collapsed_sections": [
        "vrr7QBAu_oeC",
        "fko5u9Y3hiKW",
        "nWgwvf4qODQE",
        "w8qvAbVIsL4w",
        "_prjsbFc71O8",
        "SDLggOvfqFbe",
        "syvONtOvrkBV",
        "Nm1xp9DjCejl",
        "uoUVsCIsYBVC",
        "eClRDcGVudk2"
      ],
      "last_runtime": {
        "build_target": "//learning/brain/python/client:colab_notebook",
        "kind": "private"
      }
    },
    "kernelspec": {
      "display_name": "Python 2",
      "name": "python2"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vrr7QBAu_oeC"
      },
      "source": [
        "#### Copyright 2018 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "both",
        "colab_type": "code",
        "id": "hMqWDc_m6rUC",
        "colab": {}
      },
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iMDBCBXpyfjb"
      },
      "source": [
        "#  Case Study: Debugging in Regression\n",
        "\n",
        "In this Colab, you will learn how to debug a regression problem through a case study. You will:\n",
        "\n",
        "1. Set up the problem.\n",
        "2. Interpret the correlation matrix.\n",
        "3. Implement linear and nonlinear models.\n",
        "4. Compare and choose between linear and nonlinear models.\n",
        "5. Optimize your chosen model.\n",
        "6. Debug your chosen model.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YO9VBlPg-gQi"
      },
      "source": [
        "# Setup\n",
        "\n",
        "This Colab uses the wine quality dataset<sup>[1]</sup>, which is hosted at [UCI](https://archive.ics.uci.edu/ml/datasets/wine+quality). This dataset contains data on the physicochemical properties of wine along with wine quality ratings. The problem is to predict wine quality (0-10) from physicochemical properties.\n",
        "\n",
        "Please **make a copy** of this Colab before running it. Click on *File*, and then click on *Save a copy in Drive*.\n",
        "\n",
        "<small>[1] Modeling wine preferences by data mining from physicochemical properties. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Decision Support Systems, Elsevier, 47(4):547-553, 2009.</small>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TD2JCwZopqaP"
      },
      "source": [
        "Load libraries and data by running the next cell. Display the first few rows to verify that the dataset loaded correctly."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "both",
        "colab_type": "code",
        "id": "1tdRTwl6sN9A",
        "colab": {}
      },
      "source": [
        "# Reset environment for a new run\n",
        "% reset -f\n",
        "\n",
        "# Load libraries\n",
        "from os.path import join # for joining file pathnames\n",
        "import pandas as pd\n",
        "import seaborn as sns\n",
        "import tensorflow as tf\n",
        "from tensorflow import keras\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "# Set Pandas display options\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "wineDf = pd.read_csv(\n",
        "  \"https://download.mlcc.google.com/mledu-datasets/winequality.csv\",\n",
        "  encoding='latin-1')\n",
        "wineDf.columns = ['fixed acidity','volatile acidity','citric acid',\n",
        "                     'residual sugar','chlorides','free sulfur dioxide',\n",
        "                     'total sulfur dioxide','density','pH',\n",
        "                     'sulphates','alcohol','quality']\n",
        "wineDf.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FKNBH65pM8q9"
      },
      "source": [
        "# Check Correlation Matrix"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6mfGqkcgSHYC"
      },
      "source": [
        "Before developing your ML model, you need to select features. To find informative features, check the correlation matrix by running the following cell. Which features are informative?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "HX36srdIthHj",
        "colab": {}
      },
      "source": [
        "corr_wineDf = wineDf.corr()\n",
        "plt.figure(figsize=(16,10))\n",
        "sns.heatmap(corr_wineDf, annot=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DqM4CCQJ4d7N"
      },
      "source": [
        "`alcohol` is most highly correlated with `quality`. Looking for other informative features, notice that `volatile acidity` correlates with `quality` but not with `alcohol`, making it a good second feature. Remember that a correlation matrix is not helpful if predictive signals are encoded in combinations of features."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "l6IsKBBard07"
      },
      "source": [
        "# Validate Input Data against Data Schema"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5QBbI6SBrg1k"
      },
      "source": [
        "Before processing your data, you should validate the data against a data schema as described in [Data and Feature Debugging](https://developers.google.com/machine-learning/testing-debugging/common/data-errors).\n",
        "\n",
        "First, define a function that validates data against a schema."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "NpLYlvpQsHoD",
        "colab": {}
      },
      "source": [
        "#@title Define function to validate data\n",
        "\n",
        "def test_data_schema(input_data, schema):\n",
        "  \"\"\"Tests that the datatypes and ranges of values in the dataset\n",
        "    adhere to expectations.\n",
        "\n",
        "    Args:\n",
        "      input_function: Dataframe containing data to test\n",
        "      schema: Schema which describes the properties of the data.\n",
        "  \"\"\"\n",
        "\n",
        "  def test_dtypes():\n",
        "    for column in schema.keys():\n",
        "      assert input_data[column].map(type).eq(\n",
        "          schema[column]['dtype']).all(), (\n",
        "          \"Incorrect dtype in column '%s'.\" % column\n",
        "      )\n",
        "    print('Input dtypes are correct.')\n",
        "\n",
        "  def test_ranges():\n",
        "    for column in schema.keys():\n",
        "      schema_max = schema[column]['range']['max']\n",
        "      schema_min = schema[column]['range']['min']\n",
        "      # Assert that data falls between schema min and max.\n",
        "      assert input_data[column].max() <= schema_max, (\n",
        "          \"Maximum value of column '%s' is too low.\" % column\n",
        "      )\n",
        "      assert input_data[column].min() >= schema_min, (\n",
        "          \"Minimum value of column '%s' is too high.\" % column\n",
        "      )\n",
        "    print('Data falls within specified ranges.')\n",
        "\n",
        "  test_dtypes()\n",
        "  test_ranges()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "z6Frne4HTspm"
      },
      "source": [
        "To define your schema, you need to understand the statistical properties of your dataset. Generate statistics on your dataset by running the following code cell:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "WT4dm-ALr-gD",
        "colab": {}
      },
      "source": [
        "wineDf.describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6rSPgXENscc8"
      },
      "source": [
        "Using the statistics generated above, define the data schema in the following code cell. For demonstration purposes, restrict your data schema to the first three data columns. For each data column, enter the:\n",
        "\n",
        " * minimum value\n",
        " * maximum value\n",
        " * data type\n",
        "\n",
        "As an example, the values for the first column are filled out. After entering the values, run the code cell to confirm that your input data matches the schema."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "YQ5CUFlsshm8",
        "colab": {}
      },
      "source": [
        "wine_schema = {\n",
        "    'fixed acidity': {\n",
        "        'range': {\n",
        "            'min': 3.8,\n",
        "            'max': 15.9\n",
        "        },\n",
        "        'dtype': float,\n",
        "    },\n",
        "    'volatile acidity': {\n",
        "        'range': {\n",
        "            'min': , # describe() rounds up this value, be careful\n",
        "            'max':\n",
        "        },\n",
        "        'dtype': ,\n",
        "    },\n",
        "    'citric acid': {\n",
        "        'range': {\n",
        "            'min': ,\n",
        "            'max':\n",
        "        },\n",
        "        'dtype': ,\n",
        "    }\n",
        "}\n",
        "\n",
        "print('Validating wine data against data schema...')\n",
        "test_data_schema(wineDf, wine_schema)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fko5u9Y3hiKW"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "L_fiqb-whjWu",
        "colab": {}
      },
      "source": [
        "wine_schema = {\n",
        "    'fixed acidity': {\n",
        "        'range': {\n",
        "            'min': 3.7,\n",
        "            'max': 15.9\n",
        "        },\n",
        "        'dtype': float,\n",
        "    },\n",
        "    'volatile acidity': {\n",
        "        'range': {\n",
        "            'min': 0.08,  # minimum value\n",
        "            'max': 1.6   # maximum value\n",
        "        },\n",
        "        'dtype': float,    # data type\n",
        "    },\n",
        "    'citric acid': {\n",
        "        'range': {\n",
        "            'min': 0.0, # minimum value\n",
        "            'max': 1.7  # maximum value\n",
        "        },\n",
        "        'dtype': float,   # data type\n",
        "    }\n",
        "}\n",
        "\n",
        "print('Validating wine data against data schema...')\n",
        "test_data_schema(wineDf, wine_schema)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ks-c6qeRNDbN"
      },
      "source": [
        "# Split and Normalize Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iqz_CWW-1nuC"
      },
      "source": [
        "Split the dataset into data and labels."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "BstIdTwr2A8J",
        "colab": {}
      },
      "source": [
        "wineFeatures = wineDf.copy(deep=True)\n",
        "wineFeatures.drop(columns='quality',inplace=True)\n",
        "wineLabels = wineDf['quality'].copy(deep=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zV8hRjJHNJcd"
      },
      "source": [
        "Normalize data using z-score."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "8eWEEY2N8jvM",
        "colab": {}
      },
      "source": [
        "def normalizeData(arr):\n",
        "  stdArr = np.std(arr)\n",
        "  meanArr = np.mean(arr)\n",
        "  arr = (arr-meanArr)/stdArr\n",
        "  return arr\n",
        "\n",
        "for str1 in wineFeatures.columns:\n",
        "   wineFeatures[str1] = normalizeData(wineFeatures[str1])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FRYctLRsxoKQ"
      },
      "source": [
        "# Test Engineered Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "O9FmtMP7xyfW"
      },
      "source": [
        "After normalizing your data, you should test your engineered data for errors as described in [Data and Feature Debugging](https://developers.google.com/machine-learning/testing-debugging/common/data-errors). In this section, you will test that engineering data:\n",
        "\n",
        "* Has the expected number of rows and columns.\n",
        "* Does not have null values.\n",
        "\n",
        "First, set up the testing functions by running the following code cell:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "both",
        "colab_type": "code",
        "id": "OCkV2ti1ygqe",
        "colab": {}
      },
      "source": [
        "import unittest\n",
        "\n",
        "def test_input_dim(df, n_rows, n_columns):\n",
        "  assert len(df) == n_rows, \"Unexpected number of rows.\"\n",
        "  assert len(df.columns) == n_columns, \"Unexpected number of columns.\"\n",
        "  print('Engineered data has the expected number of rows and columns.')\n",
        "\n",
        "def test_nulls(df):\n",
        "  dataNulls = df.isnull().sum().sum()\n",
        "  assert dataNulls == 0, \"Nulls in engineered data.\"\n",
        "  print('Engineered features do not contain nulls.')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TrCc2NAE4J5A"
      },
      "source": [
        "Your input data had 6497 examples and 11 feature columns. Test whether your engineered data has the expected number of rows and columns by running the following cell. Confirm that the test fails if you change the values below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "97BqVwqx4JKQ",
        "colab": {}
      },
      "source": [
        "#@title Test dimensions of engineered data\n",
        "wine_feature_rows = 6497 #@param\n",
        "wine_feature_cols = 11 #@param\n",
        "test_input_dim(wineFeatures,\n",
        "               wine_feature_rows,\n",
        "               wine_feature_cols)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Gb7CxKrN5PDq"
      },
      "source": [
        "Test that your engineered data does not contain nulls by running the code below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Gg3krAmq4aBf",
        "colab": {}
      },
      "source": [
        "test_nulls(wineFeatures)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eaU8hlK-rzbc"
      },
      "source": [
        "# Check Splits for Statistical Equivalence\n",
        "\n",
        "As described in the [Data Debugging](https://developers.google.com/machine-learning/testing-debugging/common/data-errors) guidelines, before developing your model, you should check that your training and validation splits are equally representative. Assuming a training:validation split of 80:20, compare the mean and the standard deviation of the splits by running the next two code cells. Note that this comparison is not a rigorous test for statistical equivalence but simply a quick and dirty comparison of the splits."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "GBeiaCDUshkS",
        "colab": {}
      },
      "source": [
        "splitIdx = wineFeatures.shape[0]*8/10\n",
        "wineFeatures.iloc[0:splitIdx,:].describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "BvG69ze-uS2g",
        "colab": {}
      },
      "source": [
        "wineFeatures.iloc[splitIdx:-1,:].describe()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ETUVOTizsgUa"
      },
      "source": [
        "The two splits are clearly not equally representative. To make the splits equally representative, you can shuffle the data.\n",
        "\n",
        "Run the following code cell to shuffle the data, and then recreate the features and labels from the shuffled data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "l6G6xnJdu3CF",
        "colab": {}
      },
      "source": [
        "# Shuffle data\n",
        "wineDf = wineDf.sample(frac=1).reset_index(drop=True)\n",
        "# Recreate features and labels\n",
        "wineFeatures = wineDf.copy(deep=True)\n",
        "wineFeatures.drop(columns='quality',inplace=True)\n",
        "wineLabels = wineDf['quality'].copy(deep=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-BbK2JuTvYZw"
      },
      "source": [
        "Now, confirm that the splits are equally representative by regenerating and comparing the statistics using the previous code cells. You may wonder why the initial splits differed so greatly. It turns out that in the wine dataset, the first 4897 rows contain data on white wines and the next 1599 rows contain data on red wines. When you split your dataset 80:20, then your training dataset contains 5197 examples, which is 94% white wine. The validation dataset is purely red wine. \n",
        "\n",
        "Ensuring your splits are statistically equivalent is a good development practice. In general, following good development practices will simplify your model debugging. To learn about testing for statistical equivalence, see [Equivalence Tests Lakens, D.](https://journals.sagepub.com/doi/10.1177/1948550617697177)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TH7y19etJQyd"
      },
      "source": [
        "# Establish a Baseline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zj7hlj7yJSU2"
      },
      "source": [
        "For a regression problem, the simplest baseline to predict the average value. Run the following code to calculate the mean-squared error (MSE) loss on the training split using the average value as a baseline. Your loss is approximately 0.75. Any model should beat this loss to justify its use."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "ryhxiL7qJak8",
        "colab": {}
      },
      "source": [
        "baselineMSE = np.square(wineLabels[0:splitIdx]-np.mean(wineLabels[0:splitIdx]))\n",
        "baselineMSE = np.sum(baselineMSE)/len(baselineMSE)\n",
        "print(baselineMSE)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vjRt-zgo5a1G"
      },
      "source": [
        "# Linear Model\n",
        "\n",
        "Following good ML dev practice, let's start with a linear model that uses the most informative feature from the correlation matrix: `alcohol`. Even if this model performs badly, we can still use it as a baseline. This model should beat our previous baseline's MSE of 0.75."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "X1REI7glC8GA"
      },
      "source": [
        "First, let's define a function to plot our loss and accuracy curves. The function will also print the final loss and accuracy. Instead of using `verbose=1`, you can call the function."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "l-qlE8TCC8j3",
        "colab": {}
      },
      "source": [
        "def showRegressionResults(trainHistory):\n",
        "  \"\"\"Function to:\n",
        "   * Print final loss.\n",
        "   * Plot loss curves.\n",
        "  \n",
        "  Args:\n",
        "    trainHistory: object returned by model.fit\n",
        "  \"\"\"\n",
        "  \n",
        "  # Print final loss\n",
        "  print(\"Final training loss: \" + str(trainHistory.history['loss'][-1]))\n",
        "  print(\"Final Validation loss: \" + str(trainHistory.history['val_loss'][-1]))\n",
        "  \n",
        "  # Plot loss curves\n",
        "  plt.plot(trainHistory.history['loss'])\n",
        "  plt.plot(trainHistory.history['val_loss'])\n",
        "  plt.legend(['Training loss','Validation loss'],loc='best')\n",
        "  plt.title('Loss Curves')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gSZZjpYnDLiP"
      },
      "source": [
        "For fast prototyping, let's try using a full batch per epoch to update the gradient only once per  epoch. Use the full batch by setting `batch_size = wineFeatures.shape[0]` as indicated by the code comment.\n",
        "\n",
        "What do you think of the loss curve? Can you improve it? For hints and discussion, see the following text cells."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "wHUSu92-0q8-",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Choose feature\n",
        "wineFeaturesSimple = wineFeatures['alcohol']\n",
        "# Define model\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(units=1, activation='linear', input_dim=1))\n",
        "# Specify the optimizer using the TF API to specify the learning rate\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01),\n",
        "              loss='mse')\n",
        "# Train the model!\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=50,\n",
        "                         batch_size=, # set batch size here\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nWgwvf4qODQE"
      },
      "source": [
        "## Hint"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hzYID6wZ3Cff"
      },
      "source": [
        "The loss decreases but very slowly. Possible fixes are:\n",
        "\n",
        "* Increase number of epochs.\n",
        "* Increase learning rate.\n",
        "* Decrease batch size. A lower batch size can result in larger decrease in loss per epoch, under the assumption that the smaller batches stay representative of the overall data distribution.\n",
        "\n",
        "Play with these three parameters in the code above to decrease the loss."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "w8qvAbVIsL4w"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gx8cQD8VsZtI"
      },
      "source": [
        "Run the following code cell to train the model using a reduced batch size of 100. Reducing the batch size leads to a greater decrease in loss per epoch. The minimum achievable loss is about 0.64. This is a significant increase over our baseline of 0.75."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "IDNZcb2cb_Tp",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Choose feature\n",
        "wineFeaturesSimple = wineFeatures['alcohol']\n",
        "# Define model\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(units=1, activation='linear', input_dim=1))\n",
        "# Specify the optimizer using the TF API to specify the learning rate\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01),\n",
        "              loss='mse')\n",
        "# Train the model!\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=20,\n",
        "                         batch_size=100, # set batch size here\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pkXqclbwANES"
      },
      "source": [
        "# Add Feature to Linear Model\n",
        "\n",
        "Try adding a feature to the linear model. Since you need to combine the two features into one prediction for regression, you'll also need to add a second layer. Modify the code below to implement the following changes:\n",
        "\n",
        "1. Add `'volatile acidity'` to the features in `wineFeaturesSimple`.\n",
        "1. Add a second linear layer with 1 unit.\n",
        "1. Experiment with learning rate, epochs, and batch_size to try to reduce loss.\n",
        "\n",
        "What happens to your loss?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "9l2vVIiz-06P",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Select features\n",
        "wineFeaturesSimple = wineFeatures[['alcohol', '...']] # add 'volatile acidity'\n",
        "# Define model\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             input_dim=wineFeaturesSimple.shape[1],\n",
        "                             activation='linear'))\n",
        "model.add(...) # add second layer\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=), loss='mse')\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=,\n",
        "                         batch_size=,\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_prjsbFc71O8"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "895HsCHl7-Qk"
      },
      "source": [
        "Run the following code to add the second feature and the second layer. The training loss is about 0.59, a small decrease from the previous loss of 0.64."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Ai06p1DEbadf",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Select features\n",
        "wineFeaturesSimple = wineFeatures[['alcohol', 'volatile acidity']] # add second feature\n",
        "# Define model\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             input_dim=wineFeaturesSimple.shape[1],\n",
        "                             activation='linear'))\n",
        "model.add(keras.layers.Dense(1, activation='linear')) # add second layer\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.01), loss='mse')\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=20,\n",
        "                         batch_size=100,\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "AvY20_w6Awz7"
      },
      "source": [
        "# Use a Nonlinear Model\n",
        "\n",
        "Let's try a nonlinear model. Modify the code below to make the following changes:\n",
        "\n",
        "1. Change the first layer to use `relu`. (Output layer stays linear since this is a regression problem.)\n",
        "1. As usual, specify the learning rate, epochs, and batch_size.\n",
        "\n",
        "Run the cell. Does the loss increase, decrease, or stay the same?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "YyH3Y_ycAv2x",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Define\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             input_dim=wineFeaturesSimple.shape[1],\n",
        "                             activation=))\n",
        "model.add(keras.layers.Dense(1, activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')\n",
        "# Fit\n",
        "model.fit(wineFeaturesSimple,\n",
        "          wineLabels,\n",
        "          epochs=,\n",
        "          batch_size=,\n",
        "          validation_split=0.2,\n",
        "          verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "SDLggOvfqFbe"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "sl8Ft3FrqGjl"
      },
      "source": [
        "Run the following cell to use a `relu` activation in your first hidden layer. Your loss stays about the same, perhaps declining negligibly to 0.58."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "HgsUJNaTdjDb",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Define\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             input_dim=wineFeaturesSimple.shape[1],\n",
        "                             activation='relu'))\n",
        "model.add(keras.layers.Dense(1, activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')\n",
        "# Fit\n",
        "model.fit(wineFeaturesSimple,\n",
        "          wineLabels,\n",
        "          epochs=20,\n",
        "          batch_size=100,\n",
        "          validation_split=0.2,\n",
        "          verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ef5AXAcjBoHb"
      },
      "source": [
        "# Optimize Your Model\n",
        "\n",
        "We have two features with one hidden layer but didn't see an improvement. At this point, it's tempting to use all your features with a high-capacity network. However, you must resist the temptation. Instead, follow the guidance in [Model Optimization](https://developers.google.com/machine-learning/testing-debugging/common/optimization) to improve model performance. For a hint and for a discussion, see the following text sections."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "y07BO1feBnT0",
        "colab": {}
      },
      "source": [
        "# Choose features\n",
        "wineFeaturesSimple = wineFeatures[['alcohol', 'volatile acidity']] # add features\n",
        "# Define\n",
        "model = None\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             activation='relu',\n",
        "                             input_dim=wineFeaturesSimple.shape[1]))\n",
        "# Add more layers here\n",
        "model.add(keras.layers.Dense(1,activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=,\n",
        "                         batch_size=,\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "syvONtOvrkBV"
      },
      "source": [
        "## Hint"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cy-trUPfrk9X"
      },
      "source": [
        "You can try to reduce loss by adding features, adding layers, or playing with the hyperparameters. Before adding more features, check the correlation matrix. Don't expect your loss to decrease by much. Sadly, that is a common experience in machine learning!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Nm1xp9DjCejl"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "H_URHTCL9QiX"
      },
      "source": [
        "Run the following code to:\n",
        "\n",
        "* Add the features chlorides and density.\n",
        "* Set training epochs to 100.\n",
        "* Set batch size to 100.\n",
        "\n",
        "Your loss reduces to about 0.56. That's a minor improvement over the previous loss of 0.58. It seems that adding more features or capacity isn't improving your model by much. Perhaps your model has a bug? In the next section, you will run a sanity check on your model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "E87kEZADeJFT",
        "colab": {}
      },
      "source": [
        "# Choose features\n",
        "wineFeaturesSimple = wineFeatures[['alcohol','volatile acidity','chlorides','density']]\n",
        "# Define\n",
        "model = None\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSimple.shape[1],\n",
        "                             activation='relu',\n",
        "                             input_dim=wineFeaturesSimple.shape[1]))\n",
        "# Add more layers here\n",
        "model.add(keras.layers.Dense(1,activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSimple,\n",
        "                         wineLabels,\n",
        "                         epochs=200,\n",
        "                         batch_size=100,\n",
        "                         validation_split=0.2,\n",
        "                         verbose=0)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gy6CkyIZSMN3"
      },
      "source": [
        "# Check for Implementation Bugs using Reduced Dataset\n",
        "\n",
        "Your loss isn't decreasing by much. Perhaps your model has an implementation bug. From the [Model Debugging](https://developers.google.com/machine-learning/testing-debugging/common/model-errors) guidelines, a quick test for implementation bugs is to obtain a low loss on a reduced dataset of, say, 10 examples. Remember, passing this test does not validate your modeling approach but only checks for basic implementation bugs. In your ML problem, if your model passes this test, then continue debugging your model to train on your full dataset.\n",
        "\n",
        "In the following code, experiment with the learning rate, batch size, and number of epochs. Can you reach a low loss? Choose hyperparameter values that let you iterate quickly."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "A0WLrVA7SUJe",
        "colab": {}
      },
      "source": [
        "# Choose 10 examples\n",
        "wineFeaturesSmall = wineFeatures[0:10]\n",
        "wineLabelsSmall = wineLabels[0:10]\n",
        "# Define model\n",
        "model = None\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSmall.shape[1],\n",
        "                             activation='relu',\n",
        "                             input_dim=wineFeaturesSmall.shape[1]))\n",
        "model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu'))\n",
        "model.add(keras.layers.Dense(1, activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse') # set LR\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSmall,\n",
        "                        wineLabelsSmall,\n",
        "                        epochs=,\n",
        "                        batch_size=,\n",
        "                        verbose=0)\n",
        "# Plot results\n",
        "print(\"Final training loss: \" + str(trainHistory.history['loss'][-1]))\n",
        "plt.plot(trainHistory.history['loss'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "uoUVsCIsYBVC"
      },
      "source": [
        "## Solution"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Z5TWFqrZ-oyp"
      },
      "source": [
        "Run the following code cell to train the model using these hyperparameter values:\n",
        "\n",
        "* learning rate = 0.01\n",
        "* epochs = 200\n",
        "* batch size = 10\n",
        "\n",
        "You get a low loss on your reduced dataset. This result means your model is probably solid and your previous results are as good as they'll get."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "xzIcE6j_fQNf",
        "colab": {}
      },
      "source": [
        "# Choose 10 examples\n",
        "wineFeaturesSmall = wineFeatures[0:10]\n",
        "wineLabelsSmall = wineLabels[0:10]\n",
        "# Define model\n",
        "model = None\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu',\n",
        "                             input_dim=wineFeaturesSmall.shape[1]))\n",
        "model.add(keras.layers.Dense(wineFeaturesSmall.shape[1], activation='relu'))\n",
        "model.add(keras.layers.Dense(1, activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='mse') # set LR\n",
        "# Train\n",
        "trainHistory = model.fit(wineFeaturesSmall,\n",
        "                        wineLabelsSmall,\n",
        "                        epochs=200,\n",
        "                        batch_size=10,\n",
        "                        verbose=0)\n",
        "# Plot results\n",
        "print(\"Final training loss: \" + str(trainHistory.history['loss'][-1]))\n",
        "plt.plot(trainHistory.history['loss'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LK-CuHasqxt6"
      },
      "source": [
        "# Trying a Very Complex Model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BLiaqT8AKZ8o"
      },
      "source": [
        "Let's go all in and use a very complex model with all the features. For science! And to satisfy ourselves that a simple model is indeed better. Let's use all 11 features with 3 fully-connected relu layers and a final linear layer. The next cell takes a while to run. Skip to the results in the cell after if you like."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "I5D1IyEZq1Ej",
        "colab": {}
      },
      "source": [
        "model = None\n",
        "# Define\n",
        "model = keras.Sequential()\n",
        "model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu',\n",
        "                             input_dim=wineFeatures.shape[1]))\n",
        "model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu'))\n",
        "model.add(keras.layers.Dense(wineFeatures.shape[1], activation='relu'))\n",
        "model.add(keras.layers.Dense(1,activation='linear'))\n",
        "# Compile\n",
        "model.compile(optimizer=tf.train.AdamOptimizer(), loss='mse')\n",
        "# Train the model!\n",
        "trainHistory = model.fit(wineFeatures, wineLabels, epochs=100, batch_size=100,\n",
        "                         verbose=1, validation_split = 0.2)\n",
        "# Plot results\n",
        "showRegressionResults(trainHistory)\n",
        "plt.ylim(0.4,1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eClRDcGVudk2"
      },
      "source": [
        "## Results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nkM7MT8XLEd6"
      },
      "source": [
        "If you train for long enough, the minimum achievable MSE is around 0.52, which is a decrease of 0.04 from the previous best loss of 0.56. This decrease probably isn't worth the performance and complexity cost of using all your features and a deeper network. However, that decision depends on the objectives you're optimizing for.\n",
        "\n",
        "If you train the model long enough, the validation loss actually starts increasing while training loss continues to decrease. This divergence in loss curves means your model is overfitting. The overfitting results from the closer fit that your very complex model can learn. Stick with the simpler model. You'll be happier and live longer!\n",
        "\n",
        "If you do want to optimize your loss, then play with the model to find the minimum achievable training loss before overfitting sets in. Try playing with the network parameters in the code cell above to achieve a loss of 0.51. But be warned—optimizing your loss could take a lot of trial and error.."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "yT7i3BdqvzKO"
      },
      "source": [
        "# Conclusion\n",
        "\n",
        "This Colab demonstrated the following principles:\n",
        "\n",
        "* The most important step in machine learning is understanding your data.\n",
        "* The largest gains come from the initial features and network.\n",
        "* Returns diminish as you add features and complexity.\n",
        "* Incremental development provides confidence in model quality and allows benchmarking against previous results.\n",
        "* Reproducing previous results is extremely important. Hence, always use version control."
      ]
    }
  ]
}
